import pandas as pd
= pd.read_csv("data/Players2024.csv") df
Visualisation
In this third workshop we will cover
- Simple visualisations with seaborn
- Making modifications with matplotlib
- Matplotlib from scratch
- Interactive visualisations with plotly
Setting up
With the data manipulation tools from pandas, we can now visualise our data. For this workshop we’ll be working from the “Players2024.csv” dataset, which we should bring in with pandas:
Take a quick peak at the dataset to remind yourself
print(df)
name birth_date height_cm positions nationality \
0 James Milner 1986-01-04 175.0 Midfield England
1 Anastasios Tsokanis 1991-05-02 176.0 Midfield Greece
2 Jonas Hofmann 1992-07-14 176.0 Midfield Germany
3 Pepe Reina 1982-08-31 188.0 Goalkeeper Spain
4 Lionel Carole 1991-04-12 180.0 Defender France
... ... ... ... ... ...
5930 Oleksandr Pshenychnyuk 2006-05-01 180.0 Midfield Ukraine
5931 Alex Marques 2005-10-23 186.0 Defender Portugal
5932 Tomás Silva 2006-05-25 175.0 Defender Portugal
5933 Fábio Sambú 2007-09-06 180.0 Attack Portugal
5934 Hakim Sulemana 2005-02-19 164.0 Attack Ghana
age club
0 38 Brighton and Hove Albion Football Club
1 33 Volou Neos Podosferikos Syllogos
2 32 Bayer 04 Leverkusen Fußball
3 42 Calcio Como
4 33 Kayserispor Kulübü
... ... ...
5930 18 ZAO FK Chornomorets Odessa
5931 18 Boavista Futebol Clube
5932 18 Boavista Futebol Clube
5933 17 Boavista Futebol Clube
5934 19 Randers Fodbold Club
[5935 rows x 7 columns]
Seaborn for simple visualisations
To begin our visualisations, we’ll use the package seaborn, which allows you to quickly whip up decent graphs.
import seaborn as sns
It’s called “seaborn” as a reference to fictional character Sam Seaborn, whose initials are “sns”.
Seaborn has three plotting functions
# for categorical plotting, e.g. bar plots, box plots etc.
sns.catplot(...) # for relational plotting, e.g. line plots, scatter plots
sns.relplot(...) # for distributions, e.g. histograms sns.displot(...)
We’ll begin with the first.
Categorical plots
Categorical plots are produced with seaborn’s sns.catplot()
function. There are two key pieces of information to pass:
- The data
- The variables
Let’s see if there’s a relationship between the players’ heights and positions, by placing their positions on the \(x\) axis and heights on the \(y\).
= df, x = "positions", y = "height_cm") sns.catplot(data
Our first graph! This is called a swarm plot; it’s like a scatter plot for categorical variables.
It’s already revealed two things to us about the data:
- There are some incorrect heights - nobody is shorter than 25cm!
- Someone’s position is “missing”
Let’s get rid of these with the data analysis techniques from last session
# Remove missing position
= df[df["positions"] != "Missing"]
df
# Ensure reasonable heights
= df[df["height_cm"] > 100] df
Run the plot again, it’s more reasonable now
= df, x = "positions", y = "height_cm") sns.catplot(data
Bar plots
Swarm plots are interesting but not standard. You can change the plot type with the kind
parameter
= df, x = "positions", y = "height_cm", kind = "bar") sns.catplot(data
Many aspects of your plot can be adjusted by sending in additional parameters and is where seaborn excels.
It seems like goalkeepers are taller, but not by much. Let’s look at the standard deviation for each position by changing the estimator =
parameter (default is mean)
= df, x = "positions", y = "height_cm", kind = "bar", estimator = "std") sns.catplot(data
Clearly there’s a lot less variation in goalkeepers - they’re all tall.
Box plots
Let’s make box plots instead. It’s the same procedure, just change to kind = "box"
and remove estimator =
= df, x = "positions", y = "height_cm", kind = "box") sns.catplot(data
Just as we predicted.
Distributions
Histograms
Let’s move to the “Age” parameter now. We can look at the distribution of ages with
= df, x = "age") sns.displot(data
Looks a bit funny with those gaps - let’s change the number of bins with bins = 28
= df, x = "age", bins = 28) sns.displot(data
Now, what if you wanted to look at the distribution for different variables? We can make a separate distribution for each position with the col = "positions"
argument, specifying a new column for each position
= df, x = "age", bins = 28, col = "positions") sns.displot(data
Kernel density estimates
Finally, you don’t have to do histograms. You could also do a Kernel Density Estimate, with kind = "kde"
(let’s remove bins =
and col =
)
= df, x = "age", kind = "kde") sns.displot(data
If you want a separate line for each position, we should indicate that each position needs a different colour/hue with hue = "positions"
= df, x = "age", hue = "positions", kind = "kde") sns.displot(data
Relational plots
It seems like players peak in their mid-twenties, but goalkeepers stay for longer. Let’s see if there’s a relationship between players’ age and height
Scatter plots
We’ll start with a scatter plot
= df, x = "age", y = "height_cm") sns.relplot(data
Not much of a trend there, although the bottom-right looks a bit emptier than the rest (could it be that short old players are the first to retire?).
We can use hue =
to have a look at positions again
= df, x = "age", y = "height_cm", hue = "positions") sns.relplot(data
Yup, goalkeepers are tall, and everyone else is a jumble.
Line plots
Let’s do a line plot of the average height per age.
= df, x = "age", y = "height_cm", kind = "line") sns.relplot(data
Seems pretty flat, except the ends are a bit weird because there’s not much data. Let’s eliminate everything before 17 and after 38 and plot it
# Create smaller dataframe
= (df["age"] > 17) & (df["age"] < 38)
condition = df[condition]
inner_ages
# Line plot
= inner_ages, x = "age", y = "height_cm", kind = "line") sns.relplot(data
Looks a bit shaky but that’s just because it’s zoomed in - notice that we go from 182cm to 184cm. We’ll fix this when we look at matplotlib in the next section.
Combining the two
We can combine our scatter and line plots together.
- Make the first plot as normal
- For all additional (overlaying) plots, use an axes-level plot instead of
sns.relplot()
etc. These just draw the points/bars/lines, and are normally behind-the-scenes. There’s one for every plot type, and look likesns.lineplot()
sns.scatterplot()
sns.boxplot()
sns.histplot()
- etc.
For example,
# Figure level plot
= df, x = "age", y = "height_cm", hue = "positions")
sns.relplot(data
# Axes level plot (drop the kind = )
= inner_ages, x = "age", y = "height_cm") sns.lineplot(data
You can’t include
kind =
inside an axes level plot
Let’s swap the colour variable from the scatter plot to the line plot
# Figure level plot
= df, x = "age", y = "height_cm")
sns.relplot(data
# Axes level plot (drop the kind = )
= inner_ages, x = "age", y = "height_cm", hue = "positions") sns.lineplot(data
Finally, let’s make the scatter dots smaller with s = 10
and grey with color = "grey"
.
# Figure level plot
= df, x = "age", y = "height_cm", s = 10, color = "grey")
sns.relplot(data
# Axes level plot (drop the kind = )
= inner_ages, x = "age", y = "height_cm", hue = "positions") sns.lineplot(data
Going deeper with matplotlib
Seaborn is great for simple and initial visualisations, but when you need to make adjustments it gets tricky. At its core, seaborn is just a simple way of using matplotlib, an extensive and popular plotting package. It was created as a way of doing MATLAB visualisations with Python, so if you’re coming from there, things will feel familiar.
Pros
- Customisable. You can tweak almost every parameter of the visualisations
- Fast. It can handle large data
- Popular. Lots of people use it, and knowing it will help you collaborate
Cons - a bit programmy
- Steep-ish learning curve. Creating basic plots can be easy, but its set up with enough complexity that it takes a bit of work to figure out what’s going on.
- Cumbersome. You can tweak almost everything, but this means that it can take some effort to tweak anything.
We’re barely going to touch the matplotlib surface, but we’ll look at some essentials.
To begin with, we want to bring in matplotlib as follows
import matplotlib.pyplot as plt
Saving plots
Before we move to adjusting the plot, let’s just look at how you save it. While you can do this with seaborn, the matplotlib way is also very simple.
As a first step, you should make a new folder. Navigate using your file explorer to the project and create a new folder called “plots”.
Next, save the current plot with plt.savefig("place_location_here")
, and we have to do this at the same time that we make the plot. So run all this code at once:
"plots/first_saved_plot.png") plt.savefig(
Making modifications
Titles
Notice that the \(y\) axis has an ugly label? That’s because seaborn is just drawing from your dataframe.
We can change axis labels with plt.ylabel()
# Plotting functions
= df, x = "age", y = "height_cm", s = 10, color = "grey")
sns.relplot(data = inner_ages, x = "age", y = "height_cm", hue = "positions")
sns.lineplot(data
# Customisation
"Height (cm)") plt.ylabel(
Text(4.8166666666666655, 0.5, 'Height (cm)')
and similarly you could change plt.xlabel(...)
.
Make sure you run the above line at the same time as your plotting function. You can either * Highlight all the code and press F9 * Make a cell with
#%%
and press ctrl + enter
We can also change the legend title to “positions” with plt.legend()
# Plotting functions
= df, x = "age", y = "height_cm", s = 10, color = "grey")
sns.relplot(data = inner_ages, x = "age", y = "height_cm", hue = "positions")
sns.lineplot(data
# Customisation
"Height (cm)")
plt.ylabel(= "positions") plt.legend(title
And its location with loc = "lower left"
# Plotting functions
= df, x = "age", y = "height_cm", s = 10, color = "grey")
sns.relplot(data = inner_ages, x = "age", y = "height_cm", hue = "positions")
sns.lineplot(data
# Customisation
"Height (cm)")
plt.ylabel(= "positions") plt.legend(title
And give the whole plot a title with plt.title()
# Figure level plot
= df, x = "age", y = "height_cm", s = 10, color = "grey")
sns.relplot(data
# Axes level plot (drop the kind = )
= inner_ages, x = "age", y = "height_cm", hue = "positions")
sns.lineplot(data
# Titles
"Height (cm)")
plt.ylabel(= "positions")
plt.legend(title "Players' heights vs ages") plt.title(
Text(0.5, 1.0, "Players' heights vs ages")
Annotations
You might want to annotate your plot with text and arrows. Text is simple with the plt.text()
function; we just need to specify its coordinates and the contents.
# Figure level plot
= df, x = "age", y = "height_cm", s = 10, color = "grey")
sns.relplot(data
# Axes level plot (drop the kind = )
= inner_ages, x = "age", y = "height_cm", hue = "positions")
sns.lineplot(data
# Titles
"Height (cm)")
plt.ylabel(= "positions")
plt.legend(title "Players' heights vs ages")
plt.title(
# Annotations
38.5, 181, "Not enough\ndata for mean") plt.text(
Text(38.5, 181, 'Not enough\ndata for mean')
The characters
\n
mean ‘new line’
We could annotate with arrows too. This is more complex, using the plt.annotate()
function:
# Figure level plot
= df, x = "age", y = "height_cm", s = 10, color = "grey")
sns.relplot(data
# Axes level plot (drop the kind = )
= inner_ages, x = "age", y = "height_cm", hue = "positions")
sns.lineplot(data
# Titles
"Height (cm)")
plt.ylabel(= "positions")
plt.legend(title "Players' heights vs ages")
plt.title(
# Annotations
38.5, 181, "Not enough\ndata for mean")
plt.text(= "No short\nolder players", xy = [37,165], xytext = [40,172],
plt.annotate(text = dict(width = 1, headwidth = 10, headlength = 10,
arrowprops = "black")) facecolor
Text(40, 172, 'No short\nolder players')
I’ve split this over multiple lines, but its still one function - check the brackets
All together, our plot has become
Axis limits
The last feature we’ll look at is editing axis limits. Let’s try to make more room in the bottom left for the legend with the functions plt.xlim()
and plt.ylim()
# Figure level plot
= df, x = "age", y = "height_cm", s = 10, color = "grey")
sns.relplot(data
# Axes level plot (drop the kind = )
= inner_ages, x = "age", y = "height_cm", hue = "positions")
sns.lineplot(data
# Titles
"Height (cm)")
plt.ylabel(= "positions", loc = "lower left")
plt.legend(title "Players' heights vs ages")
plt.title(
# Annotations
38.5, 181, "Not enough\ndata for mean")
plt.text("No short\nolder players", [37,165], [40,172],
plt.annotate(= dict(width = 1,headwidth = 10,headlength = 10,
arrowprops = "black"))
facecolor
# Axis limits
10,45])
plt.xlim([150,210]) plt.ylim([
I’m not sure that looks any better, but you get the idea!
Interactivity with plotly
For the last part of this section, we’re going to briefly look at making interactive plots with plotly.
We bring in the tools with
import plotly.express as px
You’ll probably need to install it first - use either
conda install plotly
OR
pip install plotly
depending on your installation.
Plotly works by creating a visualisation like we’ve been doing, and then loading it into something dynamic, like a web browser. Spyder does not support interactive plots. This means we need to change the default settings with
import plotly.io as pio
= "browser" pio.renderers.default
Now, plots should all load in your default browser.
The basics
We make plotly graphs very similarly to seaborn. Let’s take our first plot from above,
= df, x = "age", y = "height_cm", s = 10, color = "grey") sns.relplot(data
and turn it into a plotly one.
- We need to use
px.scatter
instead ofsns.relplot
- We need to use
data_frame =
instead ofdata =
- Let’s remove the
s =
andcolor =
for now - Save the plot as a variable
= df, x = "age", y = "height_cm") px.scatter(data_frame
Notice how you can hover over the points now? It’s interactive!
Introducing more info and neatening up
Like seaborn’s “hue”, we can use color =
to introduce a third variable
= df, x = "age", y = "height_cm", color = "positions") px.scatter(data_frame
And like seaborn’s “col”, we can facet with facet_col =
= df, x = "age", y = "height_cm", color = "positions",
px.scatter(data_frame = "positions") facet_col
Personally, I think these are too squished. We can specify the maximum number of columns with facet_col_wrap =
= df, x = "age", y = "height_cm", color = "positions",
px.scatter(data_frame = "positions", facet_col_wrap = 2) facet_col
Finally, let’s adjust the information in the hover. We can give each point a name with hover_name =
- how about their actual names?
= df, x = "age", y = "height_cm", color = "positions",
px.scatter(data_frame = "positions", facet_col_wrap = 2, hover_name = "name") facet_col
And let’s also include their nationalities
= df, x = "age", y = "height_cm", color = "positions",
px.scatter(data_frame = "positions", facet_col_wrap = 2, hover_name = "name",
facet_col = "nationality") hover_data
Saving interactive plots
Since these are interactive, we can’t save them as normal. The easiest option is to save them as HTML files - like websites - which we can open from our browsers.
First, save the plot into a variable
= px.scatter(data_frame = df, x = "age", y = "height_cm", color = "positions",
fig = "positions", facet_col_wrap = 2, hover_name = "name",
facet_col = "nationality") hover_data
Then, write it to HTML
"plot.html") fig.write_html(